Project 1¶

Student name: Peng Li

Student ID: 24008144

Hi Teo,

may because of my computer setting problem (Chinese system), I can not directly read the "csv" type file into python with my computer, so I changed this file into "xlsx" type and put it into the "notebook" for reading it into python. So maybe you need to use this way to check my assginment.Many thanks.

Introduction¶

We have dataset covers socio-economic data on New Zealand, stretching back to early 1980s. The data covers a range of topics: income and wealth distribution, poverty and deprivation levels, health measures, education outcomes, safety and security, housing as well as employment. There are approximately ~100 columns in the dataset. I will have data wrangling, exploratory data analysis (EDA)/data visualisation as well as detailed data analysis to uncover insights from a real-world dataset. And will try to figure out valuable conclusions and provide recommendations to policy makers for the future based on the analysis results.

Datasets used¶

1.shared_prosperity_assignment_dataset_mangled

2.Data documentation

Dataset sources: please see the "Data documentation"

Research Questions¶

1.There are 8 variable groups. For each variable group there are also many variables, so can we analyze and sub-group these variables in each group to find co-relations and potenial reasons for trends of variables, and figure out valuable conclusions?

2.Through investigation of the corelation between different vaiable groups, can help us to find potential reasons and consequences of the changes in each group?

Executive Summary¶

1.Primary data wranging to prepare for the data analysis;

2.EDA and data Visulisation to get first impression and ideas of the data and their trend/distribution; Further data wranging;

3.Get the assumption based on EDA and visulisation. And generally define the analysis direction and methodology;

4.Deeper analysis based on step 3 plan, summarize the findings to support and adjust the assumption;

5.Analysis conclusion summary, and recommnedations for the policy makers for the future.

Data Wrangling¶

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="ticks")
from pylab import rcParams
import pylab
%matplotlib inline
matplotlib.style.use('ggplot')
# Set some Pandas options as you like
pd.set_option('display.notebook_repr_html', True)
#this line enables the plots to be embedded into the notebook
rcParams['figure.figsize'] = 15, 10
rcParams['font.size'] = 20
rcParams['axes.facecolor'] = 'white'
pd.set_option('display.max_columns', None)

1. Primary data wrangling for obvious wrong or missing data:¶

-> Read the excel form (my computer can not read the "csv" type file and changed this file into "xlsx" type ) and replace '???' ,"-" and "null" to "None"

-> Sort by years

-> Delete all the '[a-zA-Z]' types in the form

-> Replace"\"

-> Replace '5.7\V' types

-> Change "" to NaN

-> Change "," in the number to "."

In [2]:
# replace '???' ,"-"  and "null" to "None"
df = pd.read_excel('shared_prosperity_assignment_dataset_mangled.xlsx',na_values=['???',"-","None"])
# sort by years
df = df.sort_values(by='year')
# delete all the '[a-zA-Z]' types in the form
df = df.replace(to_replace=r'[a-zA-Z]', value='', regex=True)
# Replace"\"
df = df.replace(to_replace=("\\"), value=np.nan)
# Replace '5.7\V' types
df = df.applymap(lambda x: str(x).split('\\')[0])
# Change "" to NaN
df.replace(" ", "", inplace=True)
df.replace("", np.nan, inplace=True)
# Change "," in the number to "."
for col in df.columns:
    df[col] = df[col].str.replace(',', '.')
df
Out[2]:
Q5:Q1 D10:D1 D10:D1-4(Palma) P90:P10_bhc P80:P20_bhc P80:P50_bhc P50:P20_bhc GINI-BHC top_10_perc_wealth_share top_5_perc_wealth_share top_1_perc_wealth_share UN_middle_class_def_inc_share OECD_middle_class_def_inc_share LIH_B60_contemp_median_AHC LIH_B50_contemp_median_AHC LIH_B50_CV_AHC LIH_B60_CV_AHC LIH_B40_contemp_median_AHC LIEH_B60_contemp_median_AHC LIEH_B50_contemp_median_AHC LIEH_B50_CV_median_AHC LIEH_B60_CV_median_AHC LIS65H_B60_CV_median_AHC LISPH_B60_CV_median_AHC LIS65H_B50_CV_median_AHC LISPH_B50_CV_median_AHC child_poverty_ahc_perc_HH_B50_CV child_poverty_ahc_perc_HH_B60_CV child_poverty_ahc_perc_HH_B40_contemp_median child_poverty_ahc_perc_HH_B50_contemp_median child_poverty_ahc_perc_HH_B60_contemp_median child_poverty_ahc_perc_HH_B60_CV_one_or_more_FT_working_parents child_poverty_ahc_perc_HH_B60_CV_one_or_more_PT_working_parents rate_of_personal_insolvencies loan_delinquency_percent national_house_rental_A30%_income_spend national_house_purchasing_A30%_income_spend national_house_rental_HAM national_house_purchasing_HAM home_ownership_rate homelessness_percentage_of_population priority_A_state_housing_applicants_as_pop_percent priority_B_state_housing_applicants_as_pop_percent debt_servicing_as_percent_of_disp_inc house_median_multiples unemployment_rate 60_64_years_unemp_rate 65years_and_over_unemp_rate underemployment_rate employees_working_long_hours labour_market_insecurity long_term_unemployment_rate pop_15_to_29_NEET low_pay_percent_OECD_def low_pay_percent_120percent_MW_def living_min_wage_gap_magnitude LIS_ms-16 LP:RPW_ms-16 teriary_education_participation_rate education_spend_as_percent_of_GDP education_spend_as_percent_of_total_gov_expenses median_teriary_loan_as_perc_of_med_inc_AHC median_teriary_loan_balance_as_perc_of_med_inc_AHC university_affordability_ratio_to_mean_salary polytechnics_affordability_ratio_to_mean_salary wananga_affordability_ratio_to_mean_salary bachelors_earning_premium_hourly dips_cert_earning_premium_hourly school_earning_premium_hourly bachelors_earning_premium_weekly dips_cert_earning_premium_weekly school_earning_premium_weekly health_expenditure_as_percent_of_gdp health_expenditure_per_capita_PPP depression_adult excellent_very_good_or_good_self_rated_health psychological_distress_adult mood_anxiety_disorder_adult healthy_weight_adult unmet_after_hours_care_due_to_cost_adult unmet_GP_care_due_to_cost_adult veg_and_fruit_intake_adult breakfast_eaten_at_home_less_5_days_a_week_child emotional_behavioural_problems_child adult_diabetes_rates depression_child excellent_very_good_or_good_parent_rated_health_child unfilled_prescription_due_to_cost_child unmet_need_for_after_hours_due_to_cost_child unmet_need_for_GP_due_to_cost_child vegetable_and_fruit_intake_child healthy_weight_child suicides_per_100000 problem_gambling_intervention_prevelance_percent total_prisoners_in_remand_rate total_sentenced_prisoners_rate total_post_sentence_offender_population_rate violent_crime_victimisations_rate recorded_murders_and_homicides_per_million regional_gdp_proportional_variation difference_in_percent_for_low_income_by_gender gender_pay_gap year
24 4.13 6.15 0.91 3.25 2.32 1.51 0.65 27.2 nan nan nan 22.2 63.3 9.0 6.0 8.0 nan 4.0 3.0 1.0 3.0 nan nan nan nan nan 11.0 nan 6.0 9.0 14.0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan NaN nan nan NaN nan NaN nan nan nan nan nan nan nan NaN nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 1982-12-31 00:00:00
33 -1.5 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan NaN NaN nan NaN nan nan nan nan NaN nan nan nan nan nan nan nan nan NaN nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 1983-12-31 00:00:00
22 nan nan nan 3.31 2.29 1.53 0.67 27.5 nan nan nan 22.3 62.5 9.0 6.0 9.0 nan 4.0 2.0 1.0 2.0 nan nan nan 1.11111111111111 3.0 14.0 nan 6.0 10.0 15.0 56.0 3.0 nan nan nan nan nan nan nan nan nan nan NaN nan nan nan NaN nan nan nan NaN nan nan nan nan nan nan NaN nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 1984-12-31 00:00:00
16 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan NaN nan NaN nan nan nan NaN nan nan nan nan NaN nan NaN nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 9.9951191236967 nan nan nan nan nan nan nan nan nan 1985-12-31 00:00:00
9 4.04 6.06 0.91 3.2 2.19 1.48 0.68 27.0 nan nan nan 22.8 64.6 7.0 5.0 7.0 nan 3.0 4.0 2.0 4.0 nan nan nan 1.42857142857142 3.14285714285714 11.0 nan 5.0 7.0 11.0 62.0 2.0 nan nan nan nan nan nan nan nan nan nan nan NaN 4.2 1.3 2 nan nan nan 7.93319426 nan nan nan nan nan nan NaN nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 12.2976089403982 nan nan nan nan nan nan nan nan nan 1986-12-31 00:00:00
4 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan NaN NaN nan 4.2 1.3 1.7 nan nan nan 10.56751469 nan nan nan nan NaN nan NaN nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 13.5716733442562 nan nan nan nan nan nan nan nan nan 1987-12-31 00:00:00
10 nan nan nan 3.11 2.24 1.49 0.67 27.1 nan nan nan 22.4 63.7 10.0 6.0 9.0 nan 4.0 6.0 2.0 5.0 nan nan nan 1.33333333333333 1.66666666666666 12.0 nan 5.0 8.0 13.0 61.0 5.0 nan nan nan nan nan nan nan nan nan nan nan nan 5.8 2.3 2.3 NaN nan nan 13.48464633 nan nan nan nan NaN NaN NaN nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 13.9263361154962 nan nan nan nan nan nan nan 0.0 nan 1988-12-31 00:00:00
26 nan nan nan nan nan nan nan nan nan nan nan 22.1 61.9 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 7.3 2.3 3.4 nan nan nan 17.44897948 NaN nan NaN nan nan NaN NaN nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 13.3658525452438 nan nan nan nan nan nan nan nan nan 1989-12-31 00:00:00
1 4.46 6.35 1.1 3.43 2.42 1.6 0.66 30.2 nan nan nan 21.3 59.5 11.0 6.0 10.0 nan 4.0 6.0 2.0 6.0 nan nan nan 1.5 2.5 15.0 nan 5.0 7.0 16.0 57.0 6.0 nan nan nan nan nan nan nan NaN nan nan nan nan 8 3.3 2.1 nan nan nan 22.06366623 nan nan nan nan nan NaN NaN nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 12.9727691986114 nan nan nan nan nan nan nan 1.0 nan 1990-12-31 00:00:00
20 nan nan nan nan nan nan nan 101.5 nan nan nan 20.2 57.5999999999999 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 73.2978318752377 nan nan nan nan nan 10.6 2.8 2.5 nan nan NaN 24.32269694 nan nan NaN nan nan nan NaN nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 13.7012554208811 nan nan nan nan nan nan nan nan nan 1991-12-31 00:00:00
21 nan nan nan 3.8 2.53 1.64 0.65 31.9 nan nan nan 20.0 56.0999999999999 17.0 11.0 20.0 nan 7.0 3.0 1.0 6.0 nan nan nan 1.5 3.45 32.0 nan 9.0 17.0 27.0 34.0 6.0 nan nan nan nan nan nan 72.6664662558244 nan nan nan nan nan 10.7 2.7 1.7 nan nan nan 32.37237262 NaN nan nan NaN nan nan NaN nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 14.1274508971636 nan nan nan nan nan nan nan 1.0 nan 1992-12-31 00:00:00
25 nan nan nan nan nan nan nan nan nan nan nan 19.7 54.3 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 72.0415738678545 nan nan nan nan NaN 9.8 2.5 1.7 nan nan nan 33.7628865 nan NaN NaN nan nan nan NaN nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 12.458383791355 nan nan nan nan nan nan nan nan nan 1993-12-31 00:00:00
0 5.09 8.03 1.21 3.87 2.52 1.66 0.66 32.2 nan nan nan 19.9 54.9 19.0 13.0 22.0 nan 7.0 3.0 1.0 7.0 nan nan nan 1.36363636363636 3.27272727272727 34.0 nan 10.0 20.0 29.0 36.0 10.0 nan nan nan nan nan nan 71.4222938333089 nan nan nan nan NaN 8.4 3.3 2.3 nan nan NaN 32.93502612 nan nan nan nan nan NaN 7.3 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 14.1483858234262 nan nan nan nan 8.18110779765849 17.1195051910757 nan 2.0 nan 1994-12-31 00:00:00
18 nan nan nan nan nan nan nan nan nan nan nan 19.8 56.0999999999999 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 70.7897583844212 nan nan nan NaN nan 6.5 2.7 1.5 nan nan nan 25.84053796 nan nan nan NaN nan nan 8.1 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 14.9690007098353 nan nan nan nan 8.30902116546058 10.881984874041 nan nan nan 1995-12-31 00:00:00
31 nan nan nan 3.79 2.55 1.66 0.65 33.1 nan nan nan 19.8 55.2 18.0 13.0 20.0 nan 8.0 6.0 3.0 7.0 nan nan nan 1.45 3.7 31.0 nan 12.0 20.0 28.0 39.0 9.0 nan nan nan nan nan nan 70.2044583274173 NaN nan NaN nan nan 6.3 2.9 1.4 nan nan nan 21.16650992 nan nan nan NaN 0.573814338648227 0.982628356529864 8.200000000000001 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 14.6565329456752 nan nan nan nan 8.33383325745199 14.7371692845238 nan 3.0 nan 1996-12-31 00:00:00
19 nan nan nan nan nan nan nan nan nan nan nan 18.9 54.4 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 1.80253994264645 nan nan nan nan 69.6332518337408 nan nan nan nan nan 6.8 3.3 1 NaN NaN nan 19.72672923 nan 13.27024441 nan NaN 0.57740251663902 0.974400010975234 8.1 4.8 15.4 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 14.8084646746629 nan nan nan nan 8.12861523819595 17.4532068945456 nan nan 0.0 1997-12-31 00:00:00
11 5.26 8.66 1.31 3.68 2.59 1.65 0.64 33.0 nan nan nan 19.6 54.9 18.0 13.0 18.0 nan 9.0 9.0 4.0 8.0 nan nan nan 1.22222222222222 3.44444444444444 27.0 nan 14.0 20.0 28.0 40.0 11.0 nan 3.99920807760839 nan nan nan nan 69.065934065934 NaN nan nan 8.3 nan 7.7 3.4 1.2 NaN nan nan 19.51951938 nan 13.12055621 NaN nan 0.576954730149871 0.975691793249282 8.3 4.9 15.7 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 15.0998123109611 nan nan nan nan 7.85463074584621 13.8937313057223 nan 3.0 16.2 1998-12-31 00:00:00
5 nan nan nan nan nan nan nan nan nan nan nan 53.6 22.9 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 2.71186440677966 nan nan nan nan 68.4968557711812 NaN nan NaN 7.7 nan 7 5.4 1.8 nan NaN nan 21.14427859 nan 12.28404701 nan NaN 0.586462449836746 0.961097218813725 9.4 5.0 15.7 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 13.4286102047181 nan 0.190133687749198 0.187631928699867 nan 7.34162041018424 13.0378776421259 nan nan 15.2 1999-12-31 00:00:00
12 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 2.25869066525498 nan nan nan nan 67.919382732473 nan NaN NaN 9.4 nan 6.1 4.8 1.7 NaN nan nan 19.86564299 nan 11.73855902 nan nan 0.566411227251697 1 10.1 5.0 16.4 25.68303400840658 39.93121895299962 5.559656080630483 5.368577686550889 3.859013985972417 nan nan nan nan nan nan 7.5 1606.3 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 11.9287437378153 nan 0.204937568001657 0.183487902181234 nan 6.9796642661002 14.5175130463186 nan nan 14.0 2000-12-31 00:00:00
36 5.42 8.32 1.34 3.91 2.68 1.66 0.62 33.8 nan nan nan 19.2 53.2 20.0 13.0 18.0 25.0 8.0 8.0 3.0 7.0 nan nan nan 1.55555555555555 3.88888888888888 28.0 37.0 11.0 21.0 30.0 42.0 12.0 nan 3.10981535471331 nan nan nan nan 67.6567223025364 0.737103455373453 nan nan 8.5 3.2040320403204 5.5 3.4 1.3 nan nan NaN 17.34475386 nan 12.22912863 nan nan 0.55267784188686 1.02704117500685 11.0 5.1 16.8 25.97396449704142 39.08165680473373 5.551530319919428 4.960433429642005 1.794410239487352 nan nan nan nan nan nan 7.6 1693.8 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 12.8634136139006 nan 0.224149021020402 0.18578742892428 nan 6.84050737129184 13.656238968722 0.54002586391768 4.0 13.1 2001-12-31 00:00:00
30 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 2.74933370739234 nan nan nan nan 67.4682395644283 NaN 0.0359384490787608 0.358194978740635 9.4 3.30073667145237 5.3 4.0 1.7 nan NaN nan 14.76293119 nan 13.62982548 nan nan 0.547495509288571 1.03656972147256 11.9 5.0 17.3 24.71911832236339 37.98561151079137 5.235648764126072 4.400569236548513 0.8853161140067168 nan nan nan nan nan nan 7.9 1834.4 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 11.6259117209399 nan 0.219123304312613 0.176452723223324 nan 6.91683539177971 16.7138114327535 0.384999957053682 nan 12.3 2002-12-31 00:00:00
14 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 7.42680633897394 0.30474565675 0.373271672749999 0.673379796 0.76148819875 67.2781705841136 nan 0.0226183678029644 0.337537552449288 9.1 3.62093417328687 4.8 3.7 1.2 NaN nan nan 13.60946763 nan 13.2223601 NaN nan 0.554726914369521 1.02331273604486 12.5 5.2 17.6 24.09006465036784 39.2851304153972 4.984461828692988 3.370603066499685 0.4781744069771662 nan nan nan nan nan nan 7.7 1852.8 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 12.3596635005989 nan 0.224867790550438 0.176378578344961 0.132507386349529 6.80162375549321 11.423434430728 0.644498818603519 nan 12.5 2003-12-31 00:00:00
3 5.51 9.15 1.31 4.17 2.74 1.62 0.59 33.4 55.0 41.0 20.0 19.0 54.6 20.0 14.0 17.0 22.0 9.0 9.0 5.0 6.0 nan nan nan 1.58823529411764 3.23529411764705 23.0 31.0 11.0 19.0 28.0 45.0 12.0 nan 3.2520325203252 0.28247260575 0.440848749 0.6587368645 0.783494614749999 67.0872634143294 nan 0.032186269474405 0.367940910313791 10.3 4.02912516350947 4 2.5 nan 3.3 nan nan 11.65048555 10.8 13.16200699 NaN nan 0.55843653911839 1.01630054518792 13.1 5.2 18.1 23.50086655112652 37.45667244367418 5.143210079430139 3.051085776072352 0.5410778670206408 nan nan nan nan nan nan 7.9 1985.5 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 11.740197271122 0.1 0.243818328564091 0.202435982096999 0.171839459975053 6.18844620539535 11.4988929257342 0.401650925416796 0.0 12.7 2004-12-31 00:00:00
34 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 2.432331295996 0.26788212875 0.496710366 0.649182601 0.800634981749999 66.898247771288 nan 0.0268133462282398 0.345454545454545 11.6 4.48167866631891 3.8 2.1 1.5 2.8 15.6518062083016 nan 9.84308144 11.1 12.43929872 nan nan 0.560911148063158 1.01453312786998 13.5 5.1 17.7 23.29185952694382 38.68104802751709 5.134079474249958 3.075128288143941 0.6006994713287425 nan nan nan nan nan nan 8.3 2123.6 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 12.1861816669171 0.000784090909090909 0.220357833655706 0.21948742746615 0.184163442940038 6.26428916827853 14.7549350420031 0.388228129128096 nan 14.0 2005-12-31 00:00:00
27 nan nan nan nan nan nan nan nan 57.0 43.0 21.0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 11.9236 1.92239650902229 0.26140010425 0.54307055475 0.64662837475 0.816000718 66.6141159349987 0.795522423721119 0.0262585716675029 0.338374787948295 12.3 4.69563924509521 3.9 1.9 1.7 2.8 14.9316199062892 nan 7.649513333 10.7 14.61223605 17.9 nan 0.575006286281902 0.989086530774958 13.3 6.1 20.1 23.57814972104255 36.55591639603631 5.160616481113521 3.331011556808302 0.5746032019552264 64.2857142857143 33.92857142857143 7.142857142857142 152.7331189710611 92.92604501607717 9.003215434083602 8.6 2396.5 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 12.2488000823393 0.000884524406852555 0.266002437101283 0.220916063364633 0.17962870045158 6.35789549136262 11.7102175232651 0.469300100802692 nan 12.1 2006-12-31 00:00:00
23 5.31 8.1 1.2 4.13 2.59 1.61 0.62 32.0 nan nan nan 20.2 59.1 nan nan nan nan nan nan nan nan nan 2.0 3.16666666666666 2.23076923076923 3.53846153846153 nan nan nan nan nan 32.0 13.0 0.119993688 2.8698553948832 0.26473952575 0.62684495675 0.6438993085 0.84298999275 66.3118871083472 nan 0.0247740286782452 0.325517012919407 13.1 5.00007544827219 3.6 1.2 nan 3.1 14.74332887088 3.25044918060302 5.96797682899999 10.9 13.44092008 nan nan 0.576097367526449 0.987016986408988 13.1 5.4 17.2 23.963735859844 36.08036917548087 5.204035084141148 3.487518095478457 0.600718653487315 63.42141863699582 37.69123783031988 7.440890125173842 159.6153846153846 105.1282051282051 7.692307692307693 8.3 2440.0 10.4 89.6 6.6 12.7 36.2 nan 111.3 43.0 7.4 1.8 5.1 0.2 97.6 nan nan nan nan 67.7 11.0436982895819 0.00125289858501727 0.274998816904074 0.226018645591784 0.188396195163503 6.18311958733614 12.0750642819598 0.587194516336476 2.0 11.9 2007-12-31 00:00:00
17 5.34 8.53 1.28 4.01 2.57 1.61 0.62 33.3 55.0 41.0 19.0 19.8 59.0 19.0 13.0 13.0 18.0 9.0 14.0 7.0 7.0 14.0 nan nan nan nan 17.0 24.0 10.0 18.0 25.0 nan nan 0.176417361 8.92911315603195 0.26500752175 0.615998379 0.63594779925 0.83271986375 66.0128788326342 nan 0.0339981229469732 0.29763022055373 13.8 4.70807624374826 4 2.1 1.1 3.2 14.1537562670152 3.38942289352417 4.326328892 11.1 12.53418077 NaN nan 0.566552622803064 1.00907681704586 12.4 5.1 16.8 24.174053182917 36.36180499597099 5.275847049000189 3.652069417554473 0.4670322292243831 67.34426229508198 32.06557377049181 4.786885245901642 151.1904761904762 92.26190476190477 3.571428571428571 9.1 2718.2 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 11.770168989926 0.00133974659784138 0.269943688409197 0.191459408728296 0.176044110746128 6.11353824495542 11.9720839172658 0.438598158892653 2.0 12.5 2008-12-31 00:00:00
15 5.34 8.53 1.28 3.98 2.55 1.58 0.63 32.9 nan nan nan 20.0 57.6 19.0 14.0 13.0 18.0 9.0 14.0 6.0 5.0 11.0 1.76470588235294 3.05882352941176 2.16666666666666 3.58333333333333 18.0 25.0 11.0 20.0 27.0 33.0 15.0 0.197668338 24.0755627009646 0.27398754475 0.547441367249999 0.645163383 0.806098038 65.7087560846871 nan 0.0409533322492973 0.292805872377987 10.8 4.65229068421294 5.8 2.9 1.4 4.3 13.4355539765084 5.2327537536621 6.413301781 14.1 12.87324359 24 nan 0.576229724523685 0.995758226068067 12.4 6.0 17.9 23.59515798429826 35.76594345825115 5.277309543396074 3.72571715416637 0.5642720403933268 62.5 25.875 6.25 135.8662613981763 103.951367781155 11.24620060790274 9.7 2954.5 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 11.329095033648 0.00226323491834885 0.288856884015888 0.202234662826081 0.1886454969918 6.31310831842783 15.8047279378781 0.541976273612743 1.0 11.5 2009-12-31 00:00:00
28 5.33 8.62 1.25 3.93 2.56 1.56 0.61 32.4 54.0 39.0 18.0 20.1 60.0999999999999 20.0 14.0 12.0 17.0 9.0 13.0 6.0 5.0 10.0 1.66666666666666 3.38888888888888 2.16666666666666 3.58333333333333 17.0 25.0 12.0 20.0 28.0 29.0 15.0 0.17039676 10.7336811841124 0.282854103 0.56577382025 0.6630272715 0.819475144499999 65.4045250974233 nan 0.0513439007580978 0.296577073282793 10.1 4.77659693016309 6.1 3.0 1.5 3.9 13.7687970059716 5.60113716125488 8.879184954 13.4 12.72185373 NaN nan 0.556224675163015 1.03122755980926 12.2 6.0 18.3 24.4140625 35.73069852941176 5.462299069965385 3.920145171817537 0.6167547991979276 63.78881987577639 28.07453416149068 5.590062111801233 146.9512195121951 104.2682926829268 4.573170731707317 9.7 3006.4 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 11.8071897448967 0.00304249942568343 0.290902825637491 0.206753962784286 0.199586492074431 5.98630829313117 9.65370771331246 0.561874204800346 2.0 10.8 2010-12-31 00:00:00
2 5.94 9.75 1.44 4.26 2.67 1.64 0.62 35.1 nan nan nan 18.9 57.5 20.0 14.0 13.0 18.0 9.0 11.0 5.0 5.0 9.0 1.94736842105263 3.21052631578947 2.14285714285714 3.57142857142857 18.0 25.0 12.0 20.0 28.0 25.0 13.0 0.140193975 7.11753731343283 0.286207073 0.522625904 0.672104344 0.80877855875 65.1031136271734 nan 0.0561977064952237 0.252650297517269 9 4.70306362317849 6 2.9 1.7 4 13.2852367640578 5.59156560897827 8.914450139 13.2 13.69347462 nan nan 0.556565360882348 1.03070274869252 11.0 5.7 16.6 26.76747526332588 40.40057452920524 5.625365837737466 4.116263815877027 0.6335082681272718 60.90584028605481 25.14898688915375 4.290822407628122 146.4071856287425 98.20359281437125 9.281437125748502 9.6 3106.5 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 10.9000006714222 0.00275630941796046 0.284385472949866 0.199165583749401 0.213596881198276 5.83858833185144 8.43968476636899 0.445194165465418 1.0 10.3 2011-12-31 00:00:00
8 5.28 8.23 1.23 4.01 2.61 1.65 0.63 32.4 nan nan nan 20.1 58.7 19.0 14.0 13.0 18.0 10.0 10.0 6.0 5.0 9.0 1.64705882352941 3.52941176470588 1.91666666666666 4.33333333333333 19.0 25.0 13.0 20.0 27.0 28.0 15.0 0.118259932 6.75171736997056 0.28689237275 0.48016812525 0.662568357 0.7887420035 64.8056861171615 nan 0.0589248872061124 0.170154397261205 8.7 4.8046064864858 6.4 3.9 1.6 4.2 13.2635662758451 5.8853006362915 13.29677878 13.5 13.55530034 23.9 nan 0.553735337412133 1.03746379202201 10.8 5.4 16.9 26.99841594869219 41.93099717961596 5.739509808579469 3.995548849960162 0.5089958053256108 59.09610983981693 20.13729977116704 1.716247139588105 143.3526011560694 92.48554913294798 0.2890173410404624 9.7 3156.8 14.3 89.3 4.5 16.3 34.3 6.7 13.6 44.5 8.7 3.3 5.5 0.3 97.8 6.6 4.5 4.7 55.5 64.5 12.3 0.00268596821366223 0.272065658512254 0.183304237422631 0.194254880177749 5.24748906069331 10.2089661947099 0.493306941151519 1.0 9.1 2012-12-31 00:00:00
7 5.38 8.29 1.3 4.02 2.62 1.64 0.63 33.4 nan nan nan 19.7 58.3 18.0 14.0 12.0 17.0 10.0 10.0 5.0 4.0 9.0 1.875 3.5 2.0 4.0 18.0 23.0 13.0 19.0 24.0 35.0 12.0 0.100523499 3.52633477633477 0.291834894 0.50975419275 0.66128961525 0.7931701685 64.4122657580919 0.914340971956731 0.116648300987249 0.180403445251534 8.5 5.03792323243437 5.8 3.6 1.6 4.1 14.0851526460705 5.33285951614379 12.17257318 11.9 14.08030591 nan 1.79347826086956 0.563012705025085 1.02123083663276 10.6 5.7 17.9 27.36466607825831 42.8471609296852 5.834845910325214 3.934648826896657 0.4902376499838548 59.83333333333333 29.16666666666667 3.499999999999995 162.3931623931624 94.01709401709401 9.971509971509972 9.4 3346.4 14.5 89.5 6.1 16.4 33.7 7.2 14.5 43.3 8.7 4.4 5.8 0.5 98.2 4.5 4.3 6.5 54.9 63.4 11.0 0.00279713045629343 0.264803112420446 0.181820226235185 0.194773652371421 5.23059797152944 10.8063568394107 0.512603102046059 2.0 11.2 2013-12-31 00:00:00
29 5.8 9.59 1.36 nan 2.78 1.66 0.6 34.2 nan nan nan 19.0 56.2 19.0 14.0 12.0 16.0 nan 12.0 6.0 4.0 8.0 2.05882352941176 3.47058823529411 1.69230769230769 3.92307692307692 17.0 23.0 nan 20.0 29.0 nan nan 0.100758708 2.78443361150128 0.29141248675 0.534556459999999 0.6453577715 0.7959981185 64.0002245424946 nan 0.189426336664377 0.176375440384658 9.4 5.30111255886082 5.4 2.9 1.6 4.1 14.0320892397174 4.76441287994384 13.63636376 11.4 13.86467017 nan 2.04654255319149 0.54257237321901 1.05957281112658 10.2 5.2 17.3 27.84883300816533 44.64195389840307 5.959064936999185 3.702775588847961 0.4748641862326134 66.66666666666666 31.11111111111112 2.777777777777778 161.4325068870523 110.7438016528926 5.785123966942149 9.4 3453.3 15.5 91.4 6.2 18.6 33.4 7.0 14.0 41.5 8.4 4 5.4 0.7 98.4 3.9 3.9 5.3 53.4 62.6 11.73 0.00279785513283552 0.239037468702222 0.163723383040482 0.190777957501495 5.17941104783851 9.53559437799096 0.570174725178598 2.0 9.9 2014-12-31 00:00:00
13 5.88 9.82 1.45 4.01 2.61 1.61 0.62 35.0 59.0 45.0 22.0 19.1 57.6 20.0 15.0 12.0 16.0 10.0 13.0 7.0 4.0 8.0 2.3125 3.0 2.81818181818181 3.18181818181818 16.0 22.0 12.0 20.0 28.0 34.0 13.0 0.10844285 2.86082681924409 0.2846640175 0.508278942499999 0.62552256425 0.77981667375 63.5854497061106 NaN 0.18291913986911 0.165807840323527 8.8 5.50522662693445 5.4 3.1 1.5 3.9 13.6141930131136 4.68318223953247 13.20604627 11.3 13.89062945 24.9 2.45194805194805 0.556715416157437 1.03172792356201 9.8 5.3 17.8 27.99002738956388 45.18926890933352 6.057549384693136 3.692838567263505 0.4592918261953701 63.13513513513514 29.72972972972973 2.702702702702703 157.1045576407507 103.7533512064343 7.238605898123325 9.3 3530.1 14.6 88.9 6.2 17.4 33.1 5.8 13.7 40.5 8.6 4.0 6.1 1.1 98 5.2 3.3 6.1 54.8 63.1 12.27 0.00277020416150283 0.248255169264888 0.165807840323527 0.187463309634074 5.53688604787685 10.4434552171586 0.451131307605537 1.0 11.8 2015-12-31 00:00:00
32 5.61 9.1 1.36 4.14 2.59 1.62 0.63 33.8 nan nan nan 19.6 57.3 20.0 14.0 11.0 15.0 10.0 14.0 7.0 5.0 8.0 2.14285714285714 3.92857142857142 2.45454545454545 3.72727272727272 14.0 20.0 13.0 19.0 25.0 nan nan 0.096088133 3.77506881635863 0.2839379445 0.534711984749999 0.607017201499999 0.78210112725 63.1780402449693 nan 0.169722133503673 0.12360268284893 8.2 5.79740687238183 5.1 3.0 1.2 4.3 15.0208548992001 4.4 14.14554371 12 11.18821004 nan 2.23358585858585 0.556034802420428 1.03252390867511 9.4 5.2 17.8 27.80509611086276 nan 6.2 3.6 0.4 55.3213367609255 28.43187660668381 2.827763496143962 160.9756097560976 111.6531165311653 16.53116531165312 nan nan 15.4 87.8 6.8 18.8 32 6.9 14.3 40.1 10.3 4.3 5.8 0.3 97.7 3.8 4 4.5 48.5 63.9 12.33 0.0026445225167678 0.281635260300223 0.17943149153625 0.200894282976684 5.67267113808155 10.6544382127819 0.321063824979607 2.0 12.0 2016-12-31 00:00:00
6 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 1.04948674735713 0.286440483 0.574313678 0.607096032 0.798594578 nan nan 0.252376980817347 0.125333611342785 8.3 6.10482289408685 4.7 3.1 1.2 4.4 nan nan 15.64339767 11.8 12.09159648 NaN nan NaN nan NaN nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 16.7 88.2 7.6 19.9 31.9 6.6 14.3 38.8 9.3 4.9 5.6 0.5 98.1 3.9 2.6 3 49.8 62.5 12.64 0.00242055879899916 0.307631359466221 0.191909924937447 0.212364470391993 nan 7.3008688033876 nan nan nan 2017-12-31 00:00:00
35 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan NaN nan nan 6.23974766855127 nan nan nan nan nan nan nan NaN nan nan nan nan NaN NaN nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 2018-12-31 00:00:00

2. Check the data types and adjust them¶

In [3]:
print(df.dtypes)
Q5:Q1                                             object
D10:D1                                            object
D10:D1-4(Palma)                                   object
P90:P10_bhc                                       object
P80:P20_bhc                                       object
                                                   ...  
recorded_murders_and_homicides_per_million        object
regional_gdp_proportional_variation               object
difference_in_percent_for_low_income_by_gender    object
gender_pay_gap                                    object
year                                              object
Length: 103, dtype: object
In [4]:
numericcolumns = df.columns[:-1]  
df[numericcolumns] = df[numericcolumns].apply(pd.to_numeric,errors='coerce')  
In [5]:
print(df.dtypes)
Q5:Q1                                             float64
D10:D1                                            float64
D10:D1-4(Palma)                                   float64
P90:P10_bhc                                       float64
P80:P20_bhc                                       float64
                                                   ...   
recorded_murders_and_homicides_per_million        float64
regional_gdp_proportional_variation               float64
difference_in_percent_for_low_income_by_gender    float64
gender_pay_gap                                    float64
year                                               object
Length: 103, dtype: object

3. More wrangling will be done in EDA/Data Visualisation¶

EDA/Data Visualisation¶

Print all the column names for easily copy later¶

In [6]:
column_names = df.columns.tolist()
print("All column names:", column_names)
All column names: ['Q5:Q1', 'D10:D1', 'D10:D1-4(Palma)', 'P90:P10_bhc', 'P80:P20_bhc', 'P80:P50_bhc', 'P50:P20_bhc', 'GINI-BHC', 'top_10_perc_wealth_share', 'top_5_perc_wealth_share', 'top_1_perc_wealth_share', 'UN_middle_class_def_inc_share', 'OECD_middle_class_def_inc_share', 'LIH_B60_contemp_median_AHC', 'LIH_B50_contemp_median_AHC', 'LIH_B50_CV_AHC', 'LIH_B60_CV_AHC', 'LIH_B40_contemp_median_AHC', 'LIEH_B60_contemp_median_AHC', 'LIEH_B50_contemp_median_AHC', 'LIEH_B50_CV_median_AHC', 'LIEH_B60_CV_median_AHC', 'LIS65H_B60_CV_median_AHC', 'LISPH_B60_CV_median_AHC', 'LIS65H_B50_CV_median_AHC', 'LISPH_B50_CV_median_AHC', 'child_poverty_ahc_perc_HH_B50_CV', 'child_poverty_ahc_perc_HH_B60_CV', 'child_poverty_ahc_perc_HH_B40_contemp_median', 'child_poverty_ahc_perc_HH_B50_contemp_median', 'child_poverty_ahc_perc_HH_B60_contemp_median', 'child_poverty_ahc_perc_HH_B60_CV_one_or_more_FT_working_parents', 'child_poverty_ahc_perc_HH_B60_CV_one_or_more_PT_working_parents', 'rate_of_personal_insolvencies', 'loan_delinquency_percent', 'national_house_rental_A30%_income_spend', 'national_house_purchasing_A30%_income_spend', 'national_house_rental_HAM', 'national_house_purchasing_HAM', 'home_ownership_rate', 'homelessness_percentage_of_population', 'priority_A_state_housing_applicants_as_pop_percent', 'priority_B_state_housing_applicants_as_pop_percent', 'debt_servicing_as_percent_of_disp_inc', 'house_median_multiples', 'unemployment_rate', '60_64_years_unemp_rate', '65years_and_over_unemp_rate', 'underemployment_rate', 'employees_working_long_hours', 'labour_market_insecurity', 'long_term_unemployment_rate', 'pop_15_to_29_NEET', 'low_pay_percent_OECD_def', 'low_pay_percent_120percent_MW_def', 'living_min_wage_gap_magnitude', 'LIS_ms-16', 'LP:RPW_ms-16', 'teriary_education_participation_rate', 'education_spend_as_percent_of_GDP', 'education_spend_as_percent_of_total_gov_expenses', 'median_teriary_loan_as_perc_of_med_inc_AHC', 'median_teriary_loan_balance_as_perc_of_med_inc_AHC', 'university_affordability_ratio_to_mean_salary', 'polytechnics_affordability_ratio_to_mean_salary', 'wananga_affordability_ratio_to_mean_salary', 'bachelors_earning_premium_hourly', 'dips_cert_earning_premium_hourly', 'school_earning_premium_hourly', 'bachelors_earning_premium_weekly', 'dips_cert_earning_premium_weekly', 'school_earning_premium_weekly', 'health_expenditure_as_percent_of_gdp', 'health_expenditure_per_capita_PPP', 'depression_adult', 'excellent_very_good_or_good_self_rated_health', 'psychological_distress_adult', 'mood_anxiety_disorder_adult', 'healthy_weight_adult', 'unmet_after_hours_care_due_to_cost_adult', 'unmet_GP_care_due_to_cost_adult', 'veg_and_fruit_intake_adult', 'breakfast_eaten_at_home_less_5_days_a_week_child', 'emotional_behavioural_problems_child', 'adult_diabetes_rates', 'depression_child', 'excellent_very_good_or_good_parent_rated_health_child', 'unfilled_prescription_due_to_cost_child', 'unmet_need_for_after_hours_due_to_cost_child', 'unmet_need_for_GP_due_to_cost_child', 'vegetable_and_fruit_intake_child', 'healthy_weight_child', 'suicides_per_100000', 'problem_gambling_intervention_prevelance_percent', 'total_prisoners_in_remand_rate', 'total_sentenced_prisoners_rate', 'total_post_sentence_offender_population_rate', 'violent_crime_victimisations_rate', 'recorded_murders_and_homicides_per_million', 'regional_gdp_proportional_variation', 'difference_in_percent_for_low_income_by_gender', 'gender_pay_gap', 'year']

1. Income and wealth¶

Primary plotting to know curves and to look for erroneous values¶

In [7]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['Q5:Q1', 'D10:D1', 'D10:D1-4(Palma)', 'P90:P10_bhc', 'P80:P20_bhc', 'P80:P50_bhc', 'P50:P20_bhc', 'GINI-BHC', 'top_10_perc_wealth_share', 'top_5_perc_wealth_share', 'top_1_perc_wealth_share', 'UN_middle_class_def_inc_share', 'OECD_middle_class_def_inc_share']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Income and wealth',fontsize=22)
plt.legend()
plt.grid(True)
plt.show()

Restore the integrity of the data: some numbers are obvious wrong¶

In [8]:
# from the chart we see four obvious wrong numbers, so will change them to NaN
df.loc[33, 'Q5:Q1'] = np.nan
df.loc[20, 'GINI-BHC'] = np.nan
df.loc[5, 'UN_middle_class_def_inc_share']= np.nan
df.loc[5, 'OECD_middle_class_def_inc_share'] = np.nan
#[5, 'UN_middle_class_def_inc_share'] and [5, 'OECD_middle_class_def_inc_share'] wrong numbers could be opposite as they are in same row and meets each ohter level.

New plotting¶

In [9]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['Q5:Q1', 'D10:D1', 'D10:D1-4(Palma)', 'P90:P10_bhc', 'P80:P20_bhc', 'P80:P50_bhc', 'P50:P20_bhc', 'GINI-BHC', 'top_10_perc_wealth_share', 'top_5_perc_wealth_share', 'top_1_perc_wealth_share', 'UN_middle_class_def_inc_share', 'OECD_middle_class_def_inc_share']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Income and wealth',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

Group 1: from decile side¶

In [10]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['Q5:Q1', 'D10:D1', 'D10:D1-4(Palma)']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('from decile side',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> The three aviables all show the difference of richest group to poorest group. The whole trend shows that the gap is becoming bigger.

-> Decreased gap at the end of 2007 and at the end of 2012, this maybe related to financial crises, which had higher influences on richest group.

-> Decreased gap at the end of 2016 is strange without finding the reason.

Group 2: data inside top 10%¶

In [11]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = [ 'top_10_perc_wealth_share', 'top_5_perc_wealth_share', 'top_1_perc_wealth_share']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('data inside top 10%',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> not so many data, but similar trends of the three variables. not so big changes from 2005 to 2016;

-> the three variables is related to 'Q5:Q1', 'D10:D1', 'D10:D1-4(Palma)', further study the wealth distribution in the top 10%;

-> although not so many data, it is still suprised to see that the top 10% occupy more than half the wealth! If we compared to the income data"'D10:D1'" which is around 9, we can see that the richest 10% accumate much more wealth if we compare to the income difference. As the poor group need to use most of their income for daily life, without not so much saving or wealth.

Group 3: from the percentage side¶

In [12]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = [ 'P90:P10_bhc', 'P80:P20_bhc', 'P80:P50_bhc', 'P50:P20_bhc']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title(' from the percentage side ',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> Similar trends as above

-> Seems that 'P50:P20_bhc' should be bigger than 1, current result could be P20:P50, so let's try to reverse it. And we can double check that: 'P50:P20_bhc' = 'P80:P20_bhc'/'P80:P50_bhc'. General calcualtion shows that our estimation is right. Both ways have the same result.

In [13]:
df['P50:P20_bhc'] = df['P50:P20_bhc'].apply(lambda x: 1/x if pd.notnull(x) and x != 0 else x)
In [14]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = [ 'P90:P10_bhc', 'P80:P20_bhc', 'P80:P50_bhc', 'P50:P20_bhc']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title(' from the percentage side ',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [15]:
df_percentage = df[['P90:P10_bhc', 'P80:P20_bhc', 'P80:P50_bhc', 'P50:P20_bhc']].transpose() 
In [16]:
fig, ax = plt.subplots()
ax.plot(df_percentage.index, df_percentage.values)
Out[16]:
[<matplotlib.lines.Line2D at 0x223ec7e8110>,
 <matplotlib.lines.Line2D at 0x223ec82f350>,
 <matplotlib.lines.Line2D at 0x223ec7ea090>,
 <matplotlib.lines.Line2D at 0x223ec7e91d0>,
 <matplotlib.lines.Line2D at 0x223ec7ea450>,
 <matplotlib.lines.Line2D at 0x223ec7e8b10>,
 <matplotlib.lines.Line2D at 0x223ec7e8310>,
 <matplotlib.lines.Line2D at 0x223ec7eb590>,
 <matplotlib.lines.Line2D at 0x223ec7eb650>,
 <matplotlib.lines.Line2D at 0x223ec7eacd0>,
 <matplotlib.lines.Line2D at 0x223ec7ebd50>,
 <matplotlib.lines.Line2D at 0x223ec7ea550>,
 <matplotlib.lines.Line2D at 0x223ec7eae50>,
 <matplotlib.lines.Line2D at 0x223ec7ea9d0>,
 <matplotlib.lines.Line2D at 0x223ec7e8fd0>,
 <matplotlib.lines.Line2D at 0x223ec7f3e50>,
 <matplotlib.lines.Line2D at 0x223ec7d6210>,
 <matplotlib.lines.Line2D at 0x223ec7f0690>,
 <matplotlib.lines.Line2D at 0x223ec7f1350>,
 <matplotlib.lines.Line2D at 0x223ec7f02d0>,
 <matplotlib.lines.Line2D at 0x223ec7f04d0>,
 <matplotlib.lines.Line2D at 0x223ec7f1c90>,
 <matplotlib.lines.Line2D at 0x223ec7eaed0>,
 <matplotlib.lines.Line2D at 0x223ec833750>,
 <matplotlib.lines.Line2D at 0x223ec831a10>,
 <matplotlib.lines.Line2D at 0x223ec7f15d0>,
 <matplotlib.lines.Line2D at 0x223ec7f0110>,
 <matplotlib.lines.Line2D at 0x223ec7f3ad0>,
 <matplotlib.lines.Line2D at 0x223ec7f2e50>,
 <matplotlib.lines.Line2D at 0x223ec16dd90>,
 <matplotlib.lines.Line2D at 0x223ec7ffed0>,
 <matplotlib.lines.Line2D at 0x223ec82d0d0>,
 <matplotlib.lines.Line2D at 0x223ec7fd250>,
 <matplotlib.lines.Line2D at 0x223ec7fe390>,
 <matplotlib.lines.Line2D at 0x223ec7fe1d0>,
 <matplotlib.lines.Line2D at 0x223ec7ff850>,
 <matplotlib.lines.Line2D at 0x223ec7ff790>]
In [17]:
df[['P90:P10_bhc', 'P80:P20_bhc', 'P80:P50_bhc', 'P50:P20_bhc']].plot(kind='box')
Out[17]:
<Axes: >

-> P90:P10 box is bigger, show the bigger changes. Combined with curve, we see that distance between the richest and poorest is even bigger year after year. This also influences the GINI level.

Group 4: GINI¶

In [18]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = [ 'GINI-BHC']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('GINI',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> Interpretation of the Gini coefficient can vary depending on the context and the country. However, in general: A Gini coefficient below 0.3 is considered low inequality. A Gini coefficient between 0.3 and 0.4 is considered moderate inequality. A Gini coefficient between 0.4 and 0.6 is considered high inequality. A Gini coefficient above 0.6 is considered severe inequality.

-> big increase from 1990s and relative big variation from 2008 to 2016; although it is still under 0.4, but the inequality is increasing.

Group 5: UN middle class and OECD middle class¶

In [19]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = [ 'UN_middle_class_def_inc_share', 'OECD_middle_class_def_inc_share']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('UN middle class and OECD middle class',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> UN_middle_class_def_inc_share:¶

Percent share of all income earned by lower earners as defined by the UN (Deciles 1-4). Actually this indicates the lowest income group (Deciles 1-4). It decreased in 1990s and then is relatively stable.

-> OECD_middle_class_def_inc_share:¶

Percent Share of all income earned by the Middle Class as defined by OECD (75%-200% of median). This can indicate the middle class. It has similar trend as the "UN_middle_class_def_inc_share" and also decreased in 1990s, but then increases again, which is different from "UN_middle_class_def_inc_share". So the middle class's situation improves compared to the group (Deciles 1-4)

-> we can see some corelation of the two variables with 'GINI-BHC'. More deeper analysis can be done in next section.

2. Poverty¶

Primary plotting to know the general trends and to look for erroneous values¶

In [20]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['LIH_B60_contemp_median_AHC', 'LIH_B50_contemp_median_AHC', 'LIH_B50_CV_AHC', 'LIH_B60_CV_AHC', 'LIH_B40_contemp_median_AHC', 'LIEH_B60_contemp_median_AHC', 'LIEH_B50_contemp_median_AHC', 'LIEH_B50_CV_median_AHC', 'LIEH_B60_CV_median_AHC', 'LIS65H_B60_CV_median_AHC', 'LISPH_B60_CV_median_AHC', 'LIS65H_B50_CV_median_AHC', 'LISPH_B50_CV_median_AHC', 'child_poverty_ahc_perc_HH_B50_CV', 'child_poverty_ahc_perc_HH_B60_CV', 'child_poverty_ahc_perc_HH_B40_contemp_median', 'child_poverty_ahc_perc_HH_B50_contemp_median', 'child_poverty_ahc_perc_HH_B60_contemp_median', 'child_poverty_ahc_perc_HH_B60_CV_one_or_more_FT_working_parents', 'child_poverty_ahc_perc_HH_B60_CV_one_or_more_PT_working_parents', 'rate_of_personal_insolvencies', 'loan_delinquency_percent']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Poverty',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> Only see one sudden increase of "OECD_middle_class_def_inc_share" in beginning of 2009, maybe related to financial crisis in 2008;

Group 1: percentage of households earning¶

In [21]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['LIH_B40_contemp_median_AHC', 'LIH_B50_contemp_median_AHC', 'LIH_B60_contemp_median_AHC', 'LIH_B50_CV_AHC', 'LIH_B60_CV_AHC']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('percentage of households earning',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> 'LIH_B50_CV_AHC', 'LIH_B60_CV_AHC' decease after 2007, while 'LIH_B40_contemp_median_AHC', 'LIH_B50_contemp_median_AHC', 'LIH_B60_contemp_median_AHC' vibrate up and down. The reason is that the income after housing costs increases;

Group 2: the ederly situation¶

In [22]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['LIEH_B50_contemp_median_AHC', 'LIEH_B60_contemp_median_AHC', 'LIEH_B50_CV_median_AHC', 'LIEH_B60_CV_median_AHC', 'LIH_B50_contemp_median_AHC', 'LIH_B60_contemp_median_AHC', 'LIH_B50_CV_AHC', 'LIH_B60_CV_AHC']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('the ederly situation',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> the elderly situation is simliar, but seems that their situation is better than the overall. Bigger decrease compared to 2007, that means that the elderly's income after housing cost has bigger increase compared to overall after 2007. But obviously not so good after 2014;

Group 3: earning of families with children¶

In [23]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = [ 'child_poverty_ahc_perc_HH_B40_contemp_median', 'child_poverty_ahc_perc_HH_B50_contemp_median', 'child_poverty_ahc_perc_HH_B60_contemp_median','child_poverty_ahc_perc_HH_B50_CV', 'child_poverty_ahc_perc_HH_B60_CV',  'child_poverty_ahc_perc_HH_B60_CV_one_or_more_FT_working_parents', 'child_poverty_ahc_perc_HH_B60_CV_one_or_more_PT_working_parents']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('earning of families with children',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

Group 4: personal insolvencies and loan delinquency¶

In [24]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['rate_of_personal_insolvencies', 'loan_delinquency_percent']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('personal insolvencies and loan delinquency',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> For 'rate_of_personal_insolvencies', the number at the end of 2006 is not right. 11.9236 could be 0.119236. But just take it as none.

-> Sudden big increase of 'loan_delinquency_percent' at end of 2009, could also be the subsequent results of financial crisis in 2008;

In [25]:
df.loc[27, 'rate_of_personal_insolvencies'] = np.nan

3. Housing¶

In [26]:
df['national_house_rental_A30%_income_spend'] = df['national_house_rental_A30%_income_spend'].apply(lambda x: x * 100 if x is not None else x)
df['national_house_purchasing_A30%_income_spend'] = df[ 'national_house_purchasing_A30%_income_spend'].apply(lambda x: x * 100 if x is not None else x)
df['priority_A_state_housing_applicants_as_pop_percent'] = df['priority_A_state_housing_applicants_as_pop_percent'].apply(lambda x: x * 100 if x is not None else x)
df['priority_B_state_housing_applicants_as_pop_percent'] = df[ 'priority_B_state_housing_applicants_as_pop_percent'].apply(lambda x: x * 100 if x is not None else x)
In [27]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['national_house_rental_A30%_income_spend', 'national_house_purchasing_A30%_income_spend', 'national_house_rental_HAM', 'national_house_purchasing_HAM', 'home_ownership_rate', 'homelessness_percentage_of_population', 'priority_A_state_housing_applicants_as_pop_percent', 'priority_B_state_housing_applicants_as_pop_percent', 'debt_servicing_as_percent_of_disp_inc', 'house_median_multiples']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Housing',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

Group 1: from house purchasing side¶

In [28]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['national_house_purchasing_A30%_income_spend', 'national_house_purchasing_HAM', 'home_ownership_rate', 'debt_servicing_as_percent_of_disp_inc', 'house_median_multiples']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)## Group 1 from purchasing side
plt.ylabel('Value',fontsize=16)
plt.title('from house purchasing side',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> 'home_ownership_rate' decreases from 1992 to 2016, which is corelated to the 'house_median_multiples' which always increases. This means that the housing price increase is bigger than the income increase, so that less people will afford a housing. It is also somehow corelated to the 'national_house_purchasing_A30%_income_spend' which has overall increase trend.

-> 'national_house_purchasing_A30%_income_spend' is very corelated with 'debt_servicing_as_percent_of_disp_inc'. The higher percentage of households that spend above 30% of income on housing costs, the higher percentage they have debt servicing of nominal disposable income.

-> 'national_house_purchasing_HAM' is relative stable. That indicates that the first buyer's decision based on finanical status has nearly no big change.

Group 2: for renting and house applicants¶

In [29]:
df['priority_AandB']=df['priority_A_state_housing_applicants_as_pop_percent']+df['priority_B_state_housing_applicants_as_pop_percent']
C:\Users\lpxue\AppData\Local\Temp\ipykernel_13584\2667346656.py:1: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df['priority_AandB']=df['priority_A_state_housing_applicants_as_pop_percent']+df['priority_B_state_housing_applicants_as_pop_percent']
In [30]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['priority_AandB','national_house_rental_A30%_income_spend','national_house_purchasing_A30%_income_spend', 'national_house_rental_HAM', 'priority_A_state_housing_applicants_as_pop_percent', 'priority_B_state_housing_applicants_as_pop_percent']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('for renting and house applicants',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> 'priority_A_state_housing_applicants_as_pop_percent', 'priority_B_state_housing_applicants_as_pop_percent' have opppiste trend, as some priority B has changed into priority A when their situation became worse. Priority A plus Priority B is still stable with some derease in 2013 but then increases again.That means after 2013, more are at risk of housing. This is quite corelated with ,'national_house_purchasing_A30%_income_spend'. When more income is spending on the housing costs, then more Priority A+ Priority B.

Priority A refers to applicants who are considered at risk and includes households with a severe and persistent housing need that must be addressed immediately.

Priority B refers to applicants who have a serious housing need and includes households with a significant and persistent need.

-> 'national_house_rental_A30%_income_spend' and 'national_house_rental_HAM' is relative stable. So that means for the renting part, it is stable.

4. Labour¶

In [31]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['unemployment_rate', '60_64_years_unemp_rate', '65years_and_over_unemp_rate', 'underemployment_rate', 'employees_working_long_hours', 'labour_market_insecurity', 'long_term_unemployment_rate', 'pop_15_to_29_NEET', 'low_pay_percent_OECD_def', 'low_pay_percent_120percent_MW_def', 'living_min_wage_gap_magnitude', 'LIS_ms-16', 'LP:RPW_ms-16']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Labour',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

Group 1: unemployment¶

In [32]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['unemployment_rate', '60_64_years_unemp_rate', '65years_and_over_unemp_rate', 'underemployment_rate', 'labour_market_insecurity', 'long_term_unemployment_rate', 'pop_15_to_29_NEET']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('unemployment',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

Long-term unemployment refers to people who have been unemployed for 12 months or more. The long-term unemployment rate shows the proportion of these long-term unemployed among all unemployed. Long-term unemployment causes significant mental and material stress for those affected and their families. It is also of particular concern for policy makers, as high rates of long-term unemployment indicate that labour markets are operating inefficiently. This indicator is measured as a percentage of unemployed. "https://data.oecd.org/unemp/long-term-unemployment-rate.htm"

Underemployment is a measure of the total number of people in an economy who are unwillingly working in low-skill and low-paying jobs or only part-time because they cannot get full-time jobs that use their skills. https://www.investopedia.com/terms/u/underemployment.asp#:~:text=The%20U.S.%20unemployment%20rate%20was%2013.3%25%20as%20of,is%20the%20sum%20of%20the%20employed%20and%20unemployed%29.%22

-> from above definition, we see big improvement of unemployment from 1993 to 2009, but from 2009 to 2018 the situation is not so good.

-> '60_64_years_unemp_rate' is corelated with the 'unemployment_rate'. '65years_and_over_unemp_rate' is stable and retirement is normal.

-> 'labour_market_insecurity', "underemployment_rateand" and 'pop_15_to_29_NEET' are very corelated with the 'unemployment_rate'. This is rational.

Group 2: employment¶

In [33]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['employees_working_long_hours', 'low_pay_percent_OECD_def', 'low_pay_percent_120percent_MW_def','living_min_wage_gap_magnitude', 'LIS_ms-16', 'LP:RPW_ms-16']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('employment',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> nearly no big findings from the three avriables.

-> Sudden increase of 'employees_working_long_hours' and decrease of 'low_pay_percent_OECD_def', opposite trend.

-> 'LIS_ms-16', 'LP:RPW_ms-16' are very stable. Stable industry and ratio of labour productivity (indexed) to real product wages (indexed) in MS-16 industries.

5. Education¶

In [34]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['teriary_education_participation_rate', 'education_spend_as_percent_of_GDP', 'education_spend_as_percent_of_total_gov_expenses', 'median_teriary_loan_as_perc_of_med_inc_AHC', 'median_teriary_loan_balance_as_perc_of_med_inc_AHC', 'university_affordability_ratio_to_mean_salary', 'polytechnics_affordability_ratio_to_mean_salary', 'wananga_affordability_ratio_to_mean_salary', 'bachelors_earning_premium_hourly', 'dips_cert_earning_premium_hourly', 'school_earning_premium_hourly', 'bachelors_earning_premium_weekly', 'dips_cert_earning_premium_weekly', 'school_earning_premium_weekly']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Education',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

Group 1: education investment and cost¶

In [35]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['teriary_education_participation_rate', 'education_spend_as_percent_of_GDP', 'education_spend_as_percent_of_total_gov_expenses', 'median_teriary_loan_as_perc_of_med_inc_AHC', 'median_teriary_loan_balance_as_perc_of_med_inc_AHC', 'university_affordability_ratio_to_mean_salary', 'polytechnics_affordability_ratio_to_mean_salary', 'wananga_affordability_ratio_to_mean_salary']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('education investment and cost',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> 'education_spend_as_percent_of_GDP', 'education_spend_as_percent_of_total_gov_expenses' show the government investment on education. Stable percentage of GDP but overall increasing percentage in total government expenses. Shows that the government shows high attention on education.

-> 'university_affordability_ratio_to_mean_salary' is relative stable with small increase. 'polytechnics_affordability_ratio_to_mean_salary' and 'wananga_affordability_ratio_to_mean_salary' decreased bigly from 2001to 2004 and then became relative stable.

-> 'median_teriary_loan_as_perc_of_med_inc_AHC' and 'median_teriary_loan_balance_as_perc_of_med_inc_AHC' has big increase from 2011. Which shows students has higher loaning although the cost is relative stable. So thatn means the family may spend more money on housing, so they have less money on education and have to loan.

-> 'teriary_education_participation_rate' increased before 2006 and decreased after 2006.

Group 2: earnings of different study degrees¶

In [36]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['bachelors_earning_premium_hourly', 'dips_cert_earning_premium_hourly', 'school_earning_premium_hourly', 'bachelors_earning_premium_weekly', 'dips_cert_earning_premium_weekly', 'school_earning_premium_weekly']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('earnings of different study degrees',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> in general it shows the education is quite related to the income/earning.

-> the earning of school qualification is a little higher than the no qualification. The diploma and master's earnings are much higher than the no qualification. So the education is important for earnings.

6. Health¶

Group 1: health expenditure overall¶

-> seprately show the 'health_expenditure_per_capita_PPP', as the number is much bigger than other variables.

In [37]:
selected_columns = ['health_expenditure_per_capita_PPP']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('health_expenditure_per_capita_PPP',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [38]:
selected_columns = ['health_expenditure_as_percent_of_gdp']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('health_expenditure_as_percent_of_gdp',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> from 2001 to 2016, nearly double. Big increase of the expenditure per capita, and big increase of pecentage in GDP, but not doubled.

Group 2: adults health¶

In [39]:
selected_columns = ['depression_adult', 'excellent_very_good_or_good_self_rated_health', 'psychological_distress_adult', 'mood_anxiety_disorder_adult', 'healthy_weight_adult', 'unmet_after_hours_care_due_to_cost_adult', 'unmet_GP_care_due_to_cost_adult', 'veg_and_fruit_intake_adult']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('adult health',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> 'depression_adult', 'psychological_distress_adult' and 'mood_anxiety_disorder_adult' all increased slowly from 2013 to 2018;

-> 'healthy_weight_adult' and 'veg_and_fruit_intake_adult' both decreased from 2013 to 2018. The two avriables are corelated. Vegetable and fruit intake is related to health weight.

-> 'unmet_after_hours_care_due_to_cost_adult' and 'unmet_GP_care_due_to_cost_adult' are relative stable.

-> 'excellent_very_good_or_good_self_rated_health' has high rate. This is strange, as the 'depression_adult' rate is high.

In [40]:
df.loc[23,  'unmet_GP_care_due_to_cost_adult'] = np.nan
In [41]:
selected_columns = ['depression_adult', 'excellent_very_good_or_good_self_rated_health', 'psychological_distress_adult', 'mood_anxiety_disorder_adult', 'healthy_weight_adult', 'unmet_after_hours_care_due_to_cost_adult', 'unmet_GP_care_due_to_cost_adult', 'veg_and_fruit_intake_adult']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('adult health',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

Group 3: children health¶

In [42]:
selected_columns = [ 'vegetable_and_fruit_intake_child', 'healthy_weight_child','excellent_very_good_or_good_parent_rated_health_child']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('children health: vegetable, weight, health',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [43]:
selected_columns = ['breakfast_eaten_at_home_less_5_days_a_week_child', 'emotional_behavioural_problems_child', 'adult_diabetes_rates', 'depression_child', 'unfilled_prescription_due_to_cost_child', 'unmet_need_for_after_hours_due_to_cost_child', 'unmet_need_for_GP_due_to_cost_child']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('childeren health',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> 'vegetable_and_fruit_intake_child', 'healthy_weight_child' are related. 'vegetable_and_fruit_intake_child' decreased from 2016 to 2018. 'excellent_very_good_or_good_parent_rated_health_child' is high, but actually the parents should care that their children should take more vegetables and fruits.

-> both 'breakfast_eaten_at_home_less_5_days_a_week_child' and 'emotional_behavioural_problems_child' increased slowly from 2016 to 2018; seems that the parents should care more about the children from food to emotion.

-> 'unfilled_prescription_due_to_cost_child', 'unmet_need_for_after_hours_due_to_cost_child', 'unmet_need_for_GP_due_to_cost_child' slowly decreased from 2016 to 2018. That means more support from the government for children are needed.

-> 'depression_child' is relative stable and slowly decreased from 2016 to 2018. Is this real? maybe some cases are not detected because of unfilled prescription etc.

Group 4: others¶

In [44]:
selected_columns = ['suicides_per_100000', 'problem_gambling_intervention_prevelance_percent']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('suicide and gambling percentage',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

7. Safety and security¶

Group 1: prisoners in remand, sentenced and post sentenced¶

In [45]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['total_prisoners_in_remand_rate', 'total_sentenced_prisoners_rate', 'total_post_sentence_offender_population_rate']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('prisoners in remand, sentenced and post sentenced',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> 'total_prisoners_in_remand_rate' and 'total_post_sentence_offender_population_rate' have similar trend. So we can also see the 'total_sentenced_prisoners_rate' is relative stable in between. -> overall we see the 'total_prisoners_in_remand_rate' increase, that means the overall saftey and security decreased, especially from 2015 to 2018.

Group 2: violance and murder¶

In [46]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['violent_crime_victimisations_rate', 'recorded_murders_and_homicides_per_million']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('violance and murder',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> both decreased, so we see the extrem crime decreased¶

8. General inequality¶

In [47]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['regional_gdp_proportional_variation', 'difference_in_percent_for_low_income_by_gender', 'gender_pay_gap']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('General inequality',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [48]:
df.loc[19, 'gender_pay_gap'] = np.nan
df.loc[10, 'difference_in_percent_for_low_income_by_gender'] = np.nan
In [49]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['regional_gdp_proportional_variation', 'difference_in_percent_for_low_income_by_gender', 'gender_pay_gap']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('General inequality',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

-> Proportional variation is good, and 'gender_pay_gap' is becoming even smaller. These are good indicates. -> In EDA and data visualisation we primarily grouped and checked the data details to get primary overall findings; we correct some data at the same time.

Assumption based on the EDA and data visualisation¶

Although for each group there are many variables, through the sub-group of these variables of each group we can find the corelation between them for better understanding this group status and trends. In EDA and data visualisation we alreday got primary impression through sub-grouping. Based on the primary findings, in data analysis I believe that we can find more clear corelation of variables in same group through detailed corelation analysis of each group, and this will help us to focus on some key variables to simplify the analysis. This will also help us to understand each group deeper.

In EDA and data visualisation, we already see some relationship between the eight different groups. So when we select the representative variable of each group, we can furhter study the corelation of each group. This will help us to understand the reasons for changes of each group, to see how they influence each other. General assumptions are as below:

1.Labour is a very important indicator of economic status/level of a country and influences many other parts. New Zealand is a relative stable market with constent policy, and it is not a big country without mature or complete supply chain for most industries, so big changes of labour is mainly caused by external influences such like global financial crisis 2007-2009, Covid etc. Labour status also influences the gender inequality;

2.Besides the external influence, education is also quite related to labour, and of course also realted to income. The eduction system prepares and trains the future labor force;

3.People get income through labour, so labour status decides the income and long termly also decides the wealth;

4.If the labour staus is good, people would like to spend more money on housing and will have more money for health.

5.If people spend much money on housing and health but can not afford them with good labour, this will leads to poverty;

6.Poverty will leads to safety and security issues. Safety and security may also cause the poverty;

In data analysis we will check and adjust the assumption based on new and deeper data analysis. In the end we will try to provide related recommendations.

General flow chart based on above understanding, the corelation will be justified and adjusted through detailed data analysis in next step.

In [50]:
from IPython.display import Image
Image(filename="D:/a.jpg", width=800, height=400)
Out[50]:

Data Analysis¶

1. Labour¶

In [51]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['unemployment_rate', '60_64_years_unemp_rate', '65years_and_over_unemp_rate', 'underemployment_rate', 'labour_market_insecurity', 'long_term_unemployment_rate', 'pop_15_to_29_NEET']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Labour of Unemployment',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [52]:
columns_needed = ['unemployment_rate', '60_64_years_unemp_rate', '65years_and_over_unemp_rate', 'underemployment_rate', 'labour_market_insecurity', 'long_term_unemployment_rate', 'pop_15_to_29_NEET']
sub_data = df[columns_needed]
correlation_matrix = sub_data.corr()
correlation_matrix
Out[52]:
unemployment_rate 60_64_years_unemp_rate 65years_and_over_unemp_rate underemployment_rate labour_market_insecurity long_term_unemployment_rate pop_15_to_29_NEET
unemployment_rate 1.000000 0.298467 0.383727 0.806789 0.991128 0.839564 0.835874
60_64_years_unemp_rate 0.298467 1.000000 -0.164078 0.819555 0.851800 0.335302 0.605242
65years_and_over_unemp_rate 0.383727 -0.164078 1.000000 -0.147069 0.820517 0.243247 0.127590
underemployment_rate 0.806789 0.819555 -0.147069 1.000000 0.785735 0.573200 0.667378
labour_market_insecurity 0.991128 0.851800 0.820517 0.785735 1.000000 0.450770 0.796450
long_term_unemployment_rate 0.839564 0.335302 0.243247 0.573200 0.450770 1.000000 -0.002802
pop_15_to_29_NEET 0.835874 0.605242 0.127590 0.667378 0.796450 -0.002802 1.000000

Conclusion 1:¶

Unemployment was continously decreased from 1993 to 2008, but increased from 2008 mainly caused by global financial crisis in 2008, an then decreased again from 2013. When financial crisis happened, the young group is most influenced.

-> "unemployment rate" and "long term unemployment rate" are two most important indicators. Both decreased from 1993/1994 and increased from 2008/2009. Because the "long-term unemployment" refers to people who have been unemployed for 12 months or more, we see the peak and turning point is one year later compared to "unemployment", but still with 0.84 corelation;

-> elderly group "60-65" umemployment is closely related to overall "unemployment rate", while "over 65" is not that senstive and relative stable as most of them are retired; corelation to "unemployment rate" is relatvie low with around 0.3;

-> young group "15-29 years" also follow the trend of "unemployment rate" closely with 0.84 corelation, and even more senstive in 2008 (sudden big increase);

-> "labour market insecurity" exactly followed the trend of "unemployment rate" with 0.99 corelation. This is reasonable. "underemployment rate" is relative stable when "unemployment rate" changes, as they only work as part time job, with 0.81 corelation.

In [53]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['employees_working_long_hours', 'low_pay_percent_OECD_def', 'low_pay_percent_120percent_MW_def','living_min_wage_gap_magnitude', 'LIS_ms-16', 'LP:RPW_ms-16']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Labour of Employment',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [54]:
from scipy.stats import linregress
non_empty_values = df['employees_working_long_hours'].dropna().tolist()
x = np.arange(len(non_empty_values))
y = np.array(non_empty_values)
slope, intercept, r_value, p_value, std_err = linregress(x, y)
plt.plot(x, y, color='red', label='Data')
plt.plot(x, slope * x + intercept, color='blue')
plt.title('Employees with long working hours_Regression line',fontsize=18)
Out[54]:
Text(0.5, 1.0, 'Employees with long working hours_Regression line')
In [55]:
mean1=df['low_pay_percent_OECD_def'].mean()
mean1
Out[55]:
13.030602276190477

Conclusion 2:¶

Internal structure in the "employment" is relative stable even "unemployment rate" changes rapidly.

-> according to above regression line we can see that the "employees with long working hours " overall decresed, which means the working condition has been improved. "low pay percentage OECD" is relative stable with result around 13;

->"low pay pecentage compared to 120%" does not have many data, so will not analyze it

-> "Labour share of income in MS-16 industries" is around 55%. This means ths MS-16 industries contribute more than half of the labour and is very stable. Same stable as "Ratio of labour productivity (indexed) to real product wages (indexed) in MS-16 industries."

2. Education¶

In [56]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['bachelors_earning_premium_hourly', 'dips_cert_earning_premium_hourly', 'school_earning_premium_hourly', 'bachelors_earning_premium_weekly', 'dips_cert_earning_premium_weekly', 'school_earning_premium_weekly']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Earnings with different education',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [57]:
columns = ['bachelors_earning_premium_hourly', 'dips_cert_earning_premium_hourly', 'school_earning_premium_hourly']
mean1= df[columns].mean()
mean1
Out[57]:
bachelors_earning_premium_hourly    62.390785
dips_cert_earning_premium_hourly    29.214599
school_earning_premium_hourly        4.456910
dtype: float64

Conclusion 1:¶

-> Education degree is clearly related to earning. Behind the earning is the Labour. Higher education will have good job and will have high income.

-> for hourly earnings, bachelor can earn on average 62 NZD and diploma/certificate earn on average 29 NZD, while school qualification only earn on average 4.5 NZD, which means many of them do not have a job.

In [58]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['teriary_education_participation_rate','education_spend_as_percent_of_GDP', 'education_spend_as_percent_of_total_gov_expenses']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Education investment and affordability',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [59]:
columns_needed = ['teriary_education_participation_rate','education_spend_as_percent_of_GDP', 'education_spend_as_percent_of_total_gov_expenses']
sub_data = df[columns_needed]
correlation_matrix = sub_data.corr()
correlation_matrix
Out[59]:
teriary_education_participation_rate education_spend_as_percent_of_GDP education_spend_as_percent_of_total_gov_expenses
teriary_education_participation_rate 1.000000 0.475257 0.668082
education_spend_as_percent_of_GDP 0.475257 1.000000 0.718629
education_spend_as_percent_of_total_gov_expenses 0.668082 0.718629 1.000000

Conclusion 2:¶

-> Goverment investment on educaiton reached peak in year 2007 and then dreased in 2007-2009, which can also be related with financial crisis. Meanwhile the teriary education participation also decreased continously from 2006.

-> "education expenditure as percentage of total government expenses" and "Education expenditure as a percentage of GDP" are middle corelated to "teriary education participation" (0.4≀|r|<0.7), so the goverment investment on education is important for teriary education participation. To prepare the high level labour.

In [60]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = [ 'teriary_education_participation_rate','median_teriary_loan_as_perc_of_med_inc_AHC', 'median_teriary_loan_balance_as_perc_of_med_inc_AHC']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Teriary loan',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [61]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['university_affordability_ratio_to_mean_salary', 'polytechnics_affordability_ratio_to_mean_salary', 'wananga_affordability_ratio_to_mean_salary']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Education fee compared to mean earning',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

Conclusion 3:¶

From 2010 the loan compared to income is increasing quickly. Average university fee is always increasing after 2004. This further influnces the the teriary education participation, which decreased continously from 2006.

In [62]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['unemployment_rate','teriary_education_participation_rate']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Education fee compared to mean earning',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [63]:
df = df.reset_index(drop=True)
new_df = df.loc[12:34, ['teriary_education_participation_rate','unemployment_rate']]
correlation_matrix = new_df.corr()
correlation_matrix
Out[63]:
teriary_education_participation_rate unemployment_rate
teriary_education_participation_rate 1.000000 -0.838318
unemployment_rate -0.838318 1.000000
In [64]:
correlation_matrix = new_df.corr()
correlation_matrix
Out[64]:
teriary_education_participation_rate unemployment_rate
teriary_education_participation_rate 1.000000 -0.838318
unemployment_rate -0.838318 1.000000

Conclusion 4:¶

From 1999 to 2013, when the "teriary education participation' increases, the "unemployment rate" decreases. And when the "teriary education participation' decreases, the "unemployment rate" increases. The turning point is 2008 financial crisis, which caused to less employment and less investment in education. Unemployment rate decrease from 2013 may have other reasons (maybe investement in industry or others)

Overall Conclusion:¶

overall the New Zealand government increased the investment on education, and it was influenced by financial crisis 2007-2009. But the loan percentage from 2010 and also the university fee from 2004 are always increasing, so goverment needs to provide more support for education espcially for university education. To provide support for education cost, if they want to increase the teriary education participation. And the the teriary education participation can provide high quality labour and increase the labour income.

3. General inequality¶

In [65]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['regional_gdp_proportional_variation', 'difference_in_percent_for_low_income_by_gender', 'gender_pay_gap']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('General inequality',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [66]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['gender_pay_gap','unemployment_rate']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('General inequality',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [67]:
df = df.reset_index(drop=True)
new_df = df.loc[16:34, ['unemployment_rate','gender_pay_gap']]
correlation_matrix = new_df.corr()
correlation_matrix
Out[67]:
unemployment_rate gender_pay_gap
unemployment_rate 1.000000 0.171208
gender_pay_gap 0.171208 1.000000

Conclusion:¶

"gender pay gap" also follows the "unemployment rate" trend. When the "unemployment rate" changes, the "gender pay gap"also changes a few years later. When the unemployment rate decreases that means more labour is needed, so the woman can also get a good pay. In difficult time when the umemployment rate increases, the gender pay gap will be bigger after a few years.

4. Income and Wealth¶

In [68]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['Q5:Q1', 'D10:D1', 'D10:D1-4(Palma)','P90:P10_bhc', 'P80:P20_bhc','GINI-BHC']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Income and wealth',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [69]:
columns_needed = ['GINI-BHC','Q5:Q1', 'D10:D1', 'D10:D1-4(Palma)','P90:P10_bhc', 'P80:P20_bhc']
sub_data = df[columns_needed]
correlation_matrix = sub_data.corr()
correlation_matrix
Out[69]:
GINI-BHC Q5:Q1 D10:D1 D10:D1-4(Palma) P90:P10_bhc P80:P20_bhc
GINI-BHC 1.000000 0.977133 0.944154 0.993647 0.920902 0.933327
Q5:Q1 0.977133 1.000000 0.980766 0.974091 0.935184 0.908565
D10:D1 0.944154 0.980766 1.000000 0.952201 0.889478 0.871680
D10:D1-4(Palma) 0.993647 0.974091 0.952201 1.000000 0.860994 0.878936
P90:P10_bhc 0.920902 0.935184 0.889478 0.860994 1.000000 0.929279
P80:P20_bhc 0.933327 0.908565 0.871680 0.878936 0.929279 1.000000

Conclusion 1:¶

Income disbribution is very corelated with GINI, so will only take GINI for further analysis.

In [70]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['GINI-BHC','unemployment_rate']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('GINI and unemployment rate',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [71]:
columns_needed = ['GINI-BHC','unemployment_rate']
sub_data = df[columns_needed]
correlation_matrix = sub_data.corr()
correlation_matrix
Out[71]:
GINI-BHC unemployment_rate
GINI-BHC 1.00000 -0.04316
unemployment_rate -0.04316 1.00000
In [72]:
df = df.reset_index(drop=True)
new_df = df.loc[25:34, ['GINI-BHC', 'unemployment_rate']]
correlation_matrix = new_df.corr()
correlation_matrix
Out[72]:
GINI-BHC unemployment_rate
GINI-BHC 1.000000 0.190525
unemployment_rate 0.190525 1.000000

Conclusion 2:¶

If we take all the years' data for analysis, we found that GINI has nearly no corelation with unemployment rate. If we take the data from 2008-2017 for analysis (GINI is lacked of data before 2008), we will find that they are weakly corelated with 0.2; So main Labour indicator (unemployment rate) seems weakly influnce the main income distribution indicator (GINI); The main reason could be that the GINI is sitll in good status under 0.4 (Gini coefficient between 0.3 and 0.4 is considered moderate inequality) and the unemployment also takes time to influence the GINI.

5. Housing¶

In [73]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = [ 'home_ownership_rate',  'house_median_multiples']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Housing',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [74]:
columns_needed = [ 'home_ownership_rate',  'house_median_multiples']
sub_data = df[columns_needed]
correlation_matrix = sub_data.corr()
correlation_matrix
Out[74]:
home_ownership_rate house_median_multiples
home_ownership_rate 1.000000 -0.893324
house_median_multiples -0.893324 1.000000

Conclusion 1:¶

'home ownership rate' and 'house median multiples' are very negative correlated (-0.89). This is reasonable. When people need to work more years to afford the housing purchasing price, the less will purchase or own a house.

In [75]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = [ 'unemployment_rate','national_house_purchasing_A30%_income_spend',  'debt_servicing_as_percent_of_disp_inc']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Unemployment, house purchasing spendure and debt',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [76]:
df = df.reset_index(drop=True)
new_df = df.loc[21:35, ['unemployment_rate','national_house_purchasing_A30%_income_spend',  'debt_servicing_as_percent_of_disp_inc']]
correlation_matrix = new_df.corr()
correlation_matrix
Out[76]:
unemployment_rate national_house_purchasing_A30%_income_spend debt_servicing_as_percent_of_disp_inc
unemployment_rate 1.000000 -0.161021 -0.680698
national_house_purchasing_A30%_income_spend -0.161021 1.000000 0.485398
debt_servicing_as_percent_of_disp_inc -0.680698 0.485398 1.000000

Conclusion 2:¶

"Percentage of households spending above 30% of income on housing" increase bigly before 2008 and decrease bigly after 2008. Obviously this was caused by the financial crisis. "Unemployment rate" is middle corelated with "Percentage of disposable income spent on household debt". When unemployment decreased, people are optimistic and will have higher loan to purchase the housing.

6. Health¶

In [77]:
selected_columns = ['health_expenditure_per_capita_PPP']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('health expenditure per capita PPP',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [78]:
selected_columns = ['health_expenditure_as_percent_of_gdp']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('health expenditure as percent of gdp',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

Conclusion 1:¶

from 2001 to 2016, nearly double of the expenditure per capita, and big increase of pecentage in GDP;

In [79]:
selected_columns = ['depression_adult', 'psychological_distress_adult', 'mood_anxiety_disorder_adult', 'healthy_weight_adult',  'veg_and_fruit_intake_adult','adult_diabetes_rates']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('adult health',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [80]:
df = df.reset_index(drop=True)
new_df = df.loc[30:35, [ 'veg_and_fruit_intake_adult','healthy_weight_adult', 'depression_adult', 'psychological_distress_adult', 'mood_anxiety_disorder_adult', 'adult_diabetes_rates']]
correlation_matrix = new_df.corr()
correlation_matrix
Out[80]:
veg_and_fruit_intake_adult healthy_weight_adult depression_adult psychological_distress_adult mood_anxiety_disorder_adult adult_diabetes_rates
veg_and_fruit_intake_adult 1.000000 0.936210 -0.808740 -0.913669 -0.895202 -0.257291
healthy_weight_adult 0.936210 1.000000 -0.803705 -0.911935 -0.875513 -0.215869
depression_adult -0.808740 -0.803705 1.000000 0.834509 0.956041 -0.308078
psychological_distress_adult -0.913669 -0.911935 0.834509 1.000000 0.844995 0.193556
mood_anxiety_disorder_adult -0.895202 -0.875513 0.956041 0.844995 1.000000 -0.192226
adult_diabetes_rates -0.257291 -0.215869 -0.308078 0.193556 -0.192226 1.000000

Conclusion 2:¶

"vegetable and fruit intake" and "healthy weight" are quite corelated, and highly negatively corelated with "depression" , "psychological distress" and "mode of anxiety disorder" . So that means for adult, less vegetable and fruit intake leads to bigger weight and related psychological health problems, but not so much to "adult diabetes";

In [81]:
selected_columns = [ 'vegetable_and_fruit_intake_child', 'healthy_weight_child','excellent_very_good_or_good_parent_rated_health_child','breakfast_eaten_at_home_less_5_days_a_week_child', 'emotional_behavioural_problems_child',  'depression_child']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Children health',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [82]:
df = df.reset_index(drop=True)
new_df = df.loc[30:35, [ 'vegetable_and_fruit_intake_child', 'healthy_weight_child','breakfast_eaten_at_home_less_5_days_a_week_child', 'emotional_behavioural_problems_child',  'depression_child']]
correlation_matrix = new_df.corr()
correlation_matrix
Out[82]:
vegetable_and_fruit_intake_child healthy_weight_child breakfast_eaten_at_home_less_5_days_a_week_child emotional_behavioural_problems_child depression_child
vegetable_and_fruit_intake_child 1.000000 0.215014 -0.874686 -0.643957 0.367604
healthy_weight_child 0.215014 1.000000 0.239368 -0.667914 -0.545509
breakfast_eaten_at_home_less_5_days_a_week_child -0.874686 0.239368 1.000000 0.400366 -0.528145
emotional_behavioural_problems_child -0.643957 -0.667914 0.400366 1.000000 0.024971
depression_child 0.367604 -0.545509 -0.528145 0.024971 1.000000

Conclusion 3:¶

"vegetable and fruit intake" and "breakfast eaten at home less than 5 days a week" are quite corelated, and highly negatively corelated with "emotional behavioural problems" . That means if children who eat less fruit or vegetable will fewer eat breakfast at home and this could leads to emotional behavioural problems. Different from adult, children who eat less fruit or vegetable is not so related to depression or health weight. Maybe children are not so rely on healthy food compared to adults.

In [83]:
selected_columns = ['unmet_after_hours_care_due_to_cost_adult', 'unmet_GP_care_due_to_cost_adult','unfilled_prescription_due_to_cost_child', 'unmet_need_for_after_hours_due_to_cost_child', 'unmet_need_for_GP_due_to_cost_child']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('unmet health care',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

Conclusion 4:¶

compared to adult, health support for children is improving.

In [84]:
selected_columns = ['depression_adult','long_term_unemployment_rate']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('adult health',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [85]:
df = df.reset_index(drop=True)
new_df = df.loc[30:35, [ 'depression_adult','long_term_unemployment_rate']]
correlation_matrix = new_df.corr()
correlation_matrix
Out[85]:
depression_adult long_term_unemployment_rate
depression_adult 1.000000 0.904932
long_term_unemployment_rate 0.904932 1.000000

Conclusion 5:¶

-> Long-term unemployment refers to people who have been unemployed for 12 months or more. The long-term unemployment rate shows the proportion of these long-term unemployed among all unemployed. Long-term unemployment causes significant mental and material stress for those affected and their families. It is also of particular concern for policy makers, as high rates of long-term unemployment indicate that labour markets are operating inefficiently. This indicator is measured as a percentage of unemployed. "https://data.oecd.org/unemp/long-term-unemployment-rate.htm"

Overall Conclusion:¶

-> vegetable and fruit intake is much more important for adults than children, which can help control the weight and reduce the pyschological problems.

-> vegetable, fruit and breakfast is also important for children, maybe not that important as for adults.

-> adults depression rate is around 15% and still increasing. "long term unemployment rate" is quite related to this. So we should care much more of the psychological status of long term unemployed people, and provide more health support to this group.

7. Poverty¶

In [86]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = [ 'LIH_B50_CV_AHC', 'LIH_B60_CV_AHC',"LIEH_B50_CV_median_AHC","LIEH_B60_CV_median_AHC","child_poverty_ahc_perc_HH_B50_CV","child_poverty_ahc_perc_HH_B60_CV"]
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Poverty compared to 2007',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [87]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = [ 'LIH_B50_contemp_median_AHC', 'LIH_B60_contemp_median_AHC',"LIEH_B50_contemp_median_AHC","LIEH_B60_contemp_median_AHC","child_poverty_ahc_perc_HH_B50_contemp_median","child_poverty_ahc_perc_HH_B60_contemp_median"]
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Poverty compared to contemporary median',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [88]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = [ 'LIS65H_B50_CV_median_AHC', 'LIS65H_B60_CV_median_AHC',"LISPH_B50_CV_median_AHC","LISPH_B60_CV_median_AHC"]
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Poverty risk ratio for Single Under 65 and for solo parent',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()

Conclusion 1:¶

-> either we compare to 2007 median income or compare to contemporary median, we find that the elderly group's situation is best, then is the general households, the worst situation is the families with children. So more support for children and families with children is needed.

In [89]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['rate_of_personal_insolvencies']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('personal insolvencies',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [90]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['loan_delinquency_percent','unemployment_rate']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Unemployment and loan delinquency percentage',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [91]:
df = df.reset_index(drop=True)
new_df = df.loc[25:34, ['unemployment_rate','loan_delinquency_percent','rate_of_personal_insolvencies']]
correlation_matrix = new_df.corr()
correlation_matrix
Out[91]:
unemployment_rate loan_delinquency_percent rate_of_personal_insolvencies
unemployment_rate 1.000000 0.256994 0.007644
loan_delinquency_percent 0.256994 1.000000 0.867414
rate_of_personal_insolvencies 0.007644 0.867414 1.000000
In [92]:
df = df.reset_index(drop=True)
new_df = df.loc[25:34, ['loan_delinquency_percent','rate_of_personal_insolvencies']]
correlation_matrix = new_df.corr()
correlation_matrix
Out[92]:
loan_delinquency_percent rate_of_personal_insolvencies
loan_delinquency_percent 1.000000 0.867414
rate_of_personal_insolvencies 0.867414 1.000000

Conclusion 2:¶

-> obviousely there is sudden big increase of "loan delinquency percentage" and "personal insolvencies" in 2008, which is mainly caused by financial crisis in 2008. There is some corelation between "unemployment rate" and "loan delinquency percentage". The "loan delinquency percentage" and "personal insolvencies" are quite corelated.

8. Safety and security¶

In [93]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['total_prisoners_in_remand_rate', 'total_sentenced_prisoners_rate', 'total_post_sentence_offender_population_rate']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Safety and security',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [94]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['violent_crime_victimisations_rate', 'recorded_murders_and_homicides_per_million']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('Safety and security',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [95]:
df['year'] = pd.to_datetime(df['year'])
selected_columns = ['rate_of_personal_insolvencies','total_prisoners_in_remand_rate']
for column in selected_columns:
    plt.plot(df['year'], df[column], marker='o', linestyle='-', label=column)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Value',fontsize=16)
plt.title('rate of personal insolvencies and total prisoners in remand',fontsize=18)
plt.legend()
plt.grid(True)
plt.show()
In [96]:
df = df.reset_index(drop=True)
new_df = df.loc[25:34, ['total_prisoners_in_remand_rate','rate_of_personal_insolvencies','loan_delinquency_percent']]
correlation_matrix = new_df.corr()
correlation_matrix
Out[96]:
total_prisoners_in_remand_rate rate_of_personal_insolvencies loan_delinquency_percent
total_prisoners_in_remand_rate 1.000000 0.588088 0.577912
rate_of_personal_insolvencies 0.588088 1.000000 0.867414
loan_delinquency_percent 0.577912 0.867414 1.000000

Conclusion:¶

-> obviousely there is an overall increasing trend of "total prisoners in remand rate" which shows overall safety is decreasing. But the serious crime such like "violent_crime" and "murder and and homicides" is decreasing.

-> "total prisoners in remand rate" is moderate corelated with "rate of personal_insolvencies" and "loan delinquency percent". So the poverty is moderate corelated with safety and security.

Total conclusion¶

-> For the 8 groups, tried to sub-group the variables to understand each group more logically and clearly for valuable findings. Meanwhile also tried to provide recommendations to the policy markers based on the findings. More details please see "Key findings for each group" below.

-> Also tried to find the corelation of "Labor" with other 7 groups, as "Labor" is a very key indicator which is ofen used to indicate the economic status of one country, and it also influences many other indicators such like "income, housing, poverty" etc. For each group we find one or two representative variables for corelation analysis. This is not a perfect way, but it is interesting to find some potential relationships between them. This helps us to understand the reasons for changes of each group variables. So each group is not independent and influences each other.

Key findings fo each group¶

-> Group 1 "Labour": Unemployment was continously decreased from 1993 to 2008, but increased from 2008 which was mainly caused by global financial crisis in 2008, and then decreased again from 2013. When financial crisis happened, the labour of young group (15-29 years old) is most influenced. "Labour share of income in MS-16 industries" is around 55%, which is more than half.

-> Group 2 "Education": It is very clear that education degree is closely related to earnings/income. Behind the earning/income is the "Labour" level. The teriary education participation rate continously increased before 2006, along with the continous decrease of "unemployment rate". This is a very good corelation, but from 2007-2008 both of their good trends have been changed. The "unemployment rate" decreased again from 2013, but the "teriary education participation rate" keeps decreasing without any change. The even higher education fee especially for university study and accordingly the even increasing teriary loan can be the main reason, although the goverment budget for eduction is also increasing (but the increasing is not that big compared to teriary loan and univeristy fee increase). So more support for the teriary education is needed: either more budget for teriary education to reduce the study fee or better policy for teriary loan etc. The teriary education participation rate will also contribute to the "Labour" and consequentially to othe parts such like "income, housing....", so high level education is very important and normally high degree leads to higher earnings.

-> Group 3 "General inequality": "gender pay gap" is a key indicator in this group. It follows the "unemployment rate" trend. When the "unemployment rate" changes, the "gender pay gap"also changes a few years later. When the "unemployment rate" decreases that means more labour is needed, so the woman can also get a good pay. In difficult time when the "umemployment rate" increases, the gender pay gap will become bigger a few years later. Of course "gender pay gap" is also influenced by others such like society development, social recognition of gender equality etc.

-> Group 4 "Income and Wealth": Income disbribution is very corelated with GINI, so will only take GINI for further analysis. If we take all the years' data for analysis, we found that GINI has nearly no corelation with "unemployment rate". If we take the data from 2008-2017 for analysis (not enough data for GINI before 2008), we will find that they are weakly corelated with 0.2. So main indicator for "Labour" (unemployment rate) will weakly influnce the main "Income and Wealth" indicator (GINI); The main reason could be that the GINI is sitll in good status under 0.4 (Gini coefficient between 0.3 and 0.4 is considered moderate inequality) and the "unemployment rate" may also take time to influence the GINI . Even during difficult time when "unemployment rate" is increasing, GINI does not change much. In general we see GINI is sligtly increasing along from 1982 with 0.27 to 2018 with 0.34. So more investigation and data is needed to find the reasons behind it.

-> Group 5 "Housing": 'home ownership rate' and 'house median multiples' are very negative correlated (-0.89). This is reasonable. When people need to work more years to afford the housing purchasing price, less people will purchase or own a house."Percentage of households spending above 30% of income on housing" increase bigly before 2008 and decrease bigly after 2008. Obviously this change was caused by the financial crisis. "Unemployment rate" is middle corelated with "Percentage of disposable income spent on household debt". When unemployment decreased, people are optimistic and are willing to afford higher loan to purchase the housing.

-> Group 6 "Health": From 2001 to 2016, nearly double of the health expenditure per capita and big increase of Health expenditure as a percentage of GDP; For adult "vegetable and fruit intake" and "healthy weight" are quite corelated, and highly negatively corelated with "depression" , "psychological distress" and "mode of anxiety disorder" . So that means for adult, less vegetable and fruit intake leads to bigger weight and related psychological health problems; "depression" is good corelated with "long-term unemployment rate". "long-term unemployment rate" refers to people who have been unemployed for 12 months or more and causes significant mental and material stress for those affected and their families (https://data.oecd.org/unemp/long-term-unemployment-rate.htm"). For children "vegetable and fruit intake" and "breakfast eaten at home less than 5 days a week" are quite corelated, and highly negatively corelated with "emotional behavioural problems". That means if children who eat less fruit or vegetable will fewer eat breakfast at home and this could leads to emotional behavioural problems. Different from adult, children who eat less fruit or vegetable is not so related to depression or health weight. Maybe children do not so rely on healthy food compared to adults. Compared to adult, health support for children is improving. So in general, we should encourage children and especially adults to eat more vegetables and fruits for physical and pyschological health, as both of them eat less now. Adults depression rate is around 15% and still increasing. "long term unemployment rate" is quite related to this. So we should care much more of the psychological status of long term unemployed people, and provide more health support to this group.

-> Group 7 "Poverty": Either we compare to 2007 median income or compare to contemporary median, we find that the elderly group's situation is best, then is the general households, the worst situation is the families with children. So more support for children and families with children is needed. Obviousely there is sudden big increase of "loan delinquency percentage" and "personal insolvencies" in 2008, which is mainly caused by financial crisis in 2008. There is some corelation between "unemployment rate" and "loan delinquency percentage", that means "Labour" status is also related to "Poverty".

-> Group 8 "Safety and security": Obviousely there is an overall increasing trend of "total prisoners in remand rate" which shows overall safety is decreasing. But the serious crime such like "violent_crime" and "murder and and homicides" is decreasing. "total prisoners in remand rate" is moderate corelated with "rate of personal insolvencies" and "loan delinquency percent". So "Safety and security" is moderate corelated with "Poverty". It is also influenced by many other sides such like wealthfare system, police system and education etc. , Poverty is just one reason of it.

General overall corelation map¶

In [97]:
from IPython.display import Image
Image(filename="D:/b.jpg", width=800, height=400)
Out[97]:

-> we just chosed one or two key representative avriables of each group. One reason is to simplify the analysis, another reason is that different variable has different quantity of data. Simplily interpolation to get more data to compare all the variables at the same time will decrease the accuracy, as some variables have only very limited data.

-> we mainly analyze the corelation with "Labour", as "Labour" is very important indicator and directly influence other groups of variables. Of course there should also be corelation between other groups: "Income and wealth" with "Housing" and "Poverty" etc.

-> as "Income and wealth" here mainly talks about income distribution such like GINI, the corelation with "Labour" is weak. But if we talk about the exact income level, "Labour" should be quite related.

-> if we have more data or more variables, such like exact income/salary etc., it will help to find the corelation, but because of time reason, did not collect extra data for deeper analysis. The trail for corelation is also an interesting journey, alough that will not be easy and the result still needs to be further investigated.

Bibliography¶

1.SuĆĄnjak, T., Schumacher, C., Ali, A., Brook, A., Geertsema, P., Matthewson, J., Owens, R., & Smith, J. (2019b). Towards a global index of shared prosperity: a case study on New Zealand. https://doi.org/10.33217/keh/sharedprosperity/001/05.2019

2.Unemployment - Long-term unemployment rate - OECD Data. (n.d.-b). theOECD. https://data.oecd.org/unemp/long-term-unemployment-rate.htm

3.Chen, J. (2022, September 26). Underemployment: definition, causes, and example. Investopedia. https://www.investopedia.com/terms/u/underemployment.asp#:~:text=The%20U.S.%20unemployment%20rate%20was%2013.3%25%20as%20of,is%20the%20sum%20of%20the%20employed%20and%20unemployed%29.%22

In [ ]: